1  Univariate Viz

Use this file for practice with the univariate viz in-class activity. Refer to the class website for details.

2 Example 1: Done for HW

# Import data
hikes <- read.csv("https://mac-stat.github.io/data/high_peaks.csv")
  • It appears that as the data is currently structured at the top of the data frame are those with the highest elevations. At first glance it does not appear that elevation is necessarily associated with difficulty or rating.

  • It does not appear a first glance there is a relationship between time and a hike’s elevation.

3 Example 2: Done for HW

The story would be much more challenging to interpret and I would have more questions most likely.

4 Exercise 1: Research Questions

#Preview what the data looks like
head(hikes)
             peak elevation difficulty ascent length time    rating
1     Mt. Marcy        5344          5   3166   14.8 10.0  moderate
2 Algonquin Peak       5114          5   2936    9.6  9.0  moderate
3   Mt. Haystack       4960          7   3570   17.8 12.0 difficult
4   Mt. Skylight       4926          7   4265   17.9 15.0 difficult
5 Whiteface Mtn.       4867          4   2535   10.4  8.5      easy
6       Dix Mtn.       4857          5   2800   13.2 10.0  moderate
  1. I would like a visualization to capture how many hikes fall into each rating, and the distribtion of the rating variable.

  2. I would like a visualization to capture the averages, or the variations of elevation as well as how the variable changes and what different elevations are present.

5 Exercise 2: Load tidyverse

# Use the ggplot function
# ggplot(hikes, aes(x = rating))

The message tells us that we can’t get the function.

# Load the package
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

6 Exercise 3: Bar Chart Raings - Part 1

ggplot(hikes, aes(x = rating))

This shows the frame of the data viz, and sets the scale of the x axis and labels it. There is nothing graphed or plotted. The first argument is the data frame you are pulling from, x=rating sets the rating variable as the x axis. I think that aes stands for - aesthetics, noting what the aesthetics of the data visualization will be.

7 Exercise 4: Bar Chart Raings - Part 2

# This specified a specific type of plot, adding a bar chart to the visualization.
ggplot(hikes, aes(x = rating)) +
  geom_bar()

# Added labels to the data viz
ggplot(hikes, aes(x = rating)) +
  geom_bar() +
  labs(x = "Rating", y = "Number of hikes")

# Fill, changes the color of the bars
ggplot(hikes, aes(x = rating)) +
  geom_bar(fill = "blue") +
  labs(x = "Rating", y = "Number of hikes")

# Adds color, outlines the bars in orange
ggplot(hikes, aes(x = rating)) +
  geom_bar(color = "orange", fill = "blue") +
  labs(x = "Rating", y = "Number of hikes")

# Added a theme to the data viz, changed the theme particularly the background
ggplot(hikes, aes(x = rating)) +
  geom_bar(color = "orange", fill = "blue")  +
  labs(x = "Rating", y = "Number of hikes") +
  theme_minimal()

8 Exercise 5: Bar Chart Follow-up

  1. The plus allows you to add customizations and specializations to the plot. I think adding geom, means geometric as we are adding geometric bars? Labs() stands for labels and adds labels that you can modify to the data viz. Color outlines the bars, fill fills them in with different colors.

  2. From this visualization I am able to learn that a lot of the hikes fall into the moderate rating category, and that the least fall into the difficult category.

  3. I do not like the the labels are lowercase, I also do not like that the are outlined in orange.

9 Exercise 6: Sad Bar Chart

# Added a theme to the data viz, changed the theme particularly the background
ggplot(hikes, aes(x = elevation)) +
  geom_bar(fill = "blue")  +
  labs(x = "Elevation", y = "Number of hikes") +
  theme_minimal()

This is not an effective visualization. This answers the range of the different elevations and the number of hikes that are each elevation. However, it is very noisy and messy and does not communicate information clearly.

10 Exercise 7: A Histogram of Elevation

Part a There are about 6 hikes between 4500 and 4700, I think 2 hikes have an elevation of at least 5100 feet.

Part b From this histogram we are able to learn a lot about the elevations of the hikes in the Adirondacks. It appears that the majority of the hikes fall in an elevation range of roughly 1000 feet, from 4000 to 5000 but there are some outliers are the ends of the tails. The minimum elevation height is around 3750 feet and the max is 5500 feet. The average elevation of hike is roughly 4250 feet. The graph looks a little right-skewed but nothing crazy.

11 Exercise 8: Building Histograms - Part 1

ggplot(hikes, aes(x = elevation)) + 
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

12 Exercise 9: Building Histograms - Part 2

# Geom_histogram() plots a histogram
ggplot(hikes, aes(x = elevation)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Color - adds white inbetween each bar
ggplot(hikes, aes(x = elevation)) +
  geom_histogram(color = "white")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Fill- changes the bar color to blue
ggplot(hikes, aes(x = elevation)) +
  geom_histogram(color = "white", fill = "blue") 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Labs - allows to edit the axis labels
ggplot(hikes, aes(x = elevation)) +
  geom_histogram(color = "white") +
  labs(x = "Elevation (feet)", y = "Number of hikes")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Binwidth - sets the width of each bin (made them very wide)
ggplot(hikes, aes(x = elevation)) +
  geom_histogram(color = "white", binwidth = 1000) +
  labs(x = "Elevation (feet)", y = "Number of hikes")

# Made the bins narrower
ggplot(hikes, aes(x = elevation)) +
  geom_histogram(color = "white", binwidth = 5) +
  labs(x = "Elevation (feet)", y = "Number of hikes")

# Binwidths were a bit more easy to look at and clear
ggplot(hikes, aes(x = elevation)) +
  geom_histogram(color = "white", binwidth = 200) +
  labs(x = "Elevation (feet)", y = "Number of hikes")

13 Exercise 10: Histogram Follow-up

The function geom_historgram added the layer, with fill setting the bin color and color dictating the space between the bins. Adding the space between the bins makes it easier to differentiate between them. Binwidth changes the number included in each bin, if they are too big or too small you do not learn as much from them and lose the information you want to communicate.

14 Exercise 11: Density Plots

ggplot(hikes, aes(x = elevation)) +
  geom_density(fill = "blue")

From the density plot we learn more average of the elevations and the range of how many hikes have each elevation. The shape itself of the range is a bit trickier to see as well as the exact elevations - the roundness of the density plot takes out some of that information.

15 Exercise 12: Density Plots vs Histograms

I like density plots because you are able to see some of the averages more and general patterns of the elevations. The histograms however allow individual elevations to be highlighted and isolated - in this context knowing that information more clearly seems to be a little be of a better idea.

16 Exercise 13: Code = communication

The two examples of the unidented code are super tricky to read and modify, it makes understanding each element and area of modification more difficult to evaluate.

Exercise 14

# Data on students in this class
survey <- read.csv("https://ajohns24.github.io/data/112/about_us_2024.csv")

# World Cup data
world_cup <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-11-29/worldcups.csv")